Machine learning-based risk factor analysis of necrotizing enterocolitis in very low birth weight infants

This study used machine learning and a national prospective cohort registry database to analyze the major risk factors of necrotizing enterocolitis (NEC) in very low birth weight (VLBW) infants, including environmental factors. The data consisted of 10,353 VLBW infants from the Korean Neonatal Network database from January 2013 to December 2017. The dependent variable was NEC. Seventy-four predictors, including ambient temperature and particulate matter, were included. An artificial neural network, decision tree, logistic regression, naïve Bayes, random forest, and support vector machine were used to evaluate the major predictors of NEC. Among the six prediction models, logistic regression and random forest had the best performance (accuracy: 0.93 and 0.93, area under the receiver-operating-characteristic curve: 0.73 and 0.72, respectively). According to random forest variable importance, major predictors of NEC were birth weight, birth weight Z-score, maternal age, gestational age, average birth year temperature, birth year, minimum birth year temperature, maximum birth year temperature, sepsis, and male sex. To the best of our knowledge, the performance of random forest in this study was among the highest in this line of research. NEC is strongly associated with ambient birth year temperature, as well as maternal and neonatal predictors.

www.nature.com/scientificreports/  Table 2) included, TABM-S 4 variables (ambient temperature average, minimum and maximum for birth month as well as sepsis in Table 2) included, TUNING The hyper-parameters of random forest and the artificial neural network are tuned for TABM-S (e.g., RF-500 and ANN-20 represent the random forest with 500 trees and the artificial neural network with two hidden layers of the size 20, respectively). www.nature.com/scientificreports/ the last box of Table 3 show that the random forests with 500, 400, 300, 200 and 100 trees were not as good as the random forest with 1000 trees. Indeed, the area under the receiver-operating-characteristic curves for the six prediction models in one of the 50 runs are presented in Fig. 1. The results in Fig. 1 came from one particular run (i.e., the 50 th run), whereas the results in Table 3 are the averages of the 50 runs. This explains why they are different from each other. The values and ranks of random forest variable importance are presented in Table 4. The importance rank of the temperature average for each of the 10, 9, 8, …, 2, 1, and 0 months before birth was below the top 30, while their sepsis counterparts were within the top 10 (9th). According to the random forest variable importance in Table 4 and  Table 5, indeed, major predictors of NEC were sepsis, BW Z-score, gestational diabetes mellitus, PDA ligation, unmarried, pulmonary hemorrhage, sex (male), maximum birth year temperature, air leak syndrome, chorioamnionitis, small-for-GA, blood gas base excess, GA, in vitro fertilization, and antenatal steroid. It needs to be noted that the results in Tables 4 and 5 came from one particular run (i.e., the 50th run).

Discussion
Among the six prediction models for NEC, logistic regression and random forest had the best performances. According to random forest variable importance, major predictors of NEC included environmental factors (ambient birth year temperature), maternal factors (maternal age, multipara, multiple pregnancy, chorioamnionitis), and neonatal factors (GA, BW, male sex, sepsis, PDA). This study confirmed that BW and GA were the main predictors of NEC. Our findings were consistent with the results of previous studies that revealed that lower BW and GA were the main risk factors for NEC 19,20 . Prematurity is well known to be the main cause of NEC. This can be explained by ischemic mucosal injury in the immature gut of preterm infants 21 . Recently, NEC has been considered to develop as multifactorial hits in the immature gut by both prenatal and postnatal factors. In addition, the gut microbiota in preterm infants is different from that in healthy term infants, and show a decreased diversity 22,23 . Moreover, prematurity reflects developmental changes in several organs other than in the gut, which increases the incidence of neonatal morbidity.
A unique finding of this study was that ambient temperature was associated with the incidence of NEC. The higher ambient temperature associated with NEC incidence may be influenced by environmental factors. Previous studies have reported that a high ambient temperature increases the risk of preterm birth [24][25][26] . Heat induces the production of proinflammatory cytokines such as interleukin (IL)-1, IL-6, and tumor necrosis factor, causing inflammatory processes at the maternal-fetal interface 27 . Furthermore, heat stress increases the production of oxytocin and prostaglandin, which are associated with uterine contractions and induce preterm labor 28,29 . It causes dehydration, resulting in decreases in maternal fluid levels, subsequently reducing fetal blood volume and leading to the production of pituitary hormones that provoke labor 30 .
Sepsis is one of the main predictors of NEC. Infection triggers inflammation in the immature gastrointestinal tract, which may contribute to NEC pathogenesis 31 . Recent findings have shown that preterm infants are exposed to a bacteria-rich environment in the neonatal intensive care unit and antibiotics that reduce the diversity of the gut microbiome 32 . Toll-like receptor 4 (TLR4) is a pathogen recognition molecule that recognizes bacterial endotoxins such as lipopolysaccharides and induces inflammation 33 . This TLR4-mediated bacterial signaling leads to increased mucosal injury and reduced mucosal repair, resulting in mucosal defects in which bacteria can  www.nature.com/scientificreports/ translocate through the circulation [34][35][36] . At this stage, bacteria lead to the inhibition of vasodilator expression, thus decreasing intestinal perfusion, which results in tissue necrosis of the gut 37 .
In this study, chorioamnionitis was found to be a predictor of NEC. There have been debates regarding prenatal infection or inflammation and its effects on NEC. Some studies reported no association, but others demonstrated that chorioamnionitis was associated with preterm birth, and it was also associated with inflammation and infection in infants during perinatal periods [38][39][40] . A meta-analysis by Been et al. revealed that chorioamnionitis is significantly associated with NEC 41 . Our findings are consistent with the results of these studies. Gastrointestinal inflammatory markers were increased in preterm infants exposed to chorioamnionitis, reflecting the proinflammatory state of the gut after birth 42 . The gut microbiome reflects amniotic fluid with chorioamnionitis 43 . In this condition, preterm infants may have disturbed barrier function, which would increase the susceptibility of the gut to secondary hits, such as sepsis and circulatory instability, leading to an increased incidence of NEC 41 .
In this study, multiparity was significantly associated with NEC. Lee et al. reported similar results in VLBW infants 40 . This finding may explain why the infant can be affected by maternal parity, exposure to maternal stress factors from recurrent pregnancy, oxidative stress, and passive transfer of immunomodulators that change the gut microbiota of neonates.
There are some limitations to this study. First, address information was not provided in the Korean Neonatal Network (KNN) database; hence, national averages were taken for PM 10 and temperature variables in this study. More specific information on these predictors would improve the validity of research in this direction. Second, this study did not consider the possible mediating effects of the various predictors. Third, this study did not focus on examining the possible mechanisms between major predictors and NEC. Fourth, this study did not include indoor factors that could be major predictors of NEC. Fifth, it was beyond the scope of this study to compare various re-sampling approaches regarding class imbalance, i.e., the proportion of NEC was only 6.8%. Under-sampling involves the reduction of the majority class for the balance, whereas over-sampling involves the expansion of the minority class for the goal. For example, a recent study compared the performance measures of four machine learning models in the cases of under-sampling and over-sampling for the prediction of cardiovascular disease 44 . Few studies are available, and further investigation is needed on this topic. Sixth, maternal age, GA, BW, BW Z-score and environmental predictors were not normalized in order to keep their full information. Using different rescaling methods for these continuous predictors (e.g., normalization) and comparing their results would make a valuable contribution for this line of research. Seventh, this study followed existing literature 49,53,54 to focus on top-10 predictors in terms of random forest variable importance. However, it needs to be noted that there has been no consensus on the threshold of major predictors in terms of random forest variable importance. Eighth, this study focused on random forest variable importance instead of logistic regression variable importance. Logistic regression performed as good as did the random forest in this study. But logistic regression requires an unrealistic assumption of ceteris paribus, i.e., "all the other variables staying constant. " For this reason, we used random forest variable importance for evaluating the importance ranking of a major predictor and univariate analysis for testing the direction of association between NEC and the predictor. Some predictors ranked within the top 15 in the random forest but out of the top 30 in logistic regression, i.e., BW (1st vs. 63rd), maternal age (3rd vs. 52nd), average birth year temperature (5th vs. 56th), birth year (6th vs. www.nature.com/scientificreports/ 65th), primipara (11thvs. 33rd) and surfactant use (12th vs. 40th). Little literature is available and more examination is needed on comparing the variable importance of various statistical approaches.
To the best of our knowledge, the performance of the random forest in this study (the area under the receiver operating characteristic curve of 0.72) is among the highest in this line of research. NEC is strongly associated with birth year temperature, as well as maternal and neonatal predictors.

Methods
Participants and variables. The data consisted of 10,353 VLBW infants from the KNN database from January 2013 to December 2017. The KNN started in April 2013 as a national prospective cohort registry of VLBW infants admitted or transferred to neonatal intensive care units across South Korea (it covers 74 neonatal intensive care units now). It collects perinatal and neonatal data of VLBW infants based on a standardized operating procedure 45 .
The dependent variable was NEC, with binary categories (no, yes). The following 47 perinatal predictors were considered (43 of them had binary categories): sex, birth-year (categorical: 2013, 2014, 2015, 2016, 2017), birthmonth, birth-season (spring, summer, autumn, winter), multiple pregnancy, in vitro fertilization, gestational diabetes mellitus, overt diabetes mellitus, pregnancy-induced hypertension, chronic hypertension, histologic chorioamnionitis, pre-labor rupture of membranes > 18 h, antenatal steroid, cesarean section, oligohydramnios, polyhydramnios, maternal age (years), primipara, maternal education (categorical: elementary, junior high, senior high, college or higher), maternal citizenship, paternal education (categorical: elementary, junior high, senior high, college or higher), paternal citizenship, marital status, congenital infection, 1-min Apgar score ≤ 3, 5-min Apgar score < 7, neonatal resuscitation program, intensive neonatal resuscitation (intubation, chest compression   46 . Gestational diabetes mellitus was defined as any degree of glucose intolerance with the onset or first recognition during pregnancy. Pregnancy-induced hypertension was defined as hypertension with onset in the latter part of pregnancy (> 20 weeks' gestation), followed by normalization of blood pressure postpartum. Chorioamnionitis was defined as histologic chorioamnionitis 47 . Oligohydramnios (or polyhydramnios) was defined as an amniotic fluid index of < 5 cm (or > 24 cm). Small-for-GA was defined as BW below the 10 th percentile, according to the Fenton growth chart 48 .
Statistical analysis. Artificial neural networks, decision trees, logistic regression, naïve Bayes, random forests, and support vector machines were used for predicting NEC [49][50][51][52][53][54] . The following default parameters were adopted for convenience: The splitting criterion was GINI, the max depth was not determined and the number of trees was 1000 in the random forest; the radial basis function kernel was employed in the support vector machine; and the limited memory Broyden-Fletcher-Goldfarb-Shanno algorithm served for the optimization of the artificial neural network. Data on 10,353 observations with full information were divided into training and validation sets in a 70:30 ratio. Accuracy, which is the ratio of correct predictions among 3,106 observations, was employed as the standard for validating the models. Random forest variable importance, the contribution of a certain variable to the performance (GINI) of the random forest, was used to examine the major predictors of NEC in VLBW infants, including environmental factors. The random split and analysis were repeated 50 times, and the average was used for external validation 55,56 . Different seed numbers were used for different runs but the default parameters stayed the same throughout the random splits and analyses. R-Studio 1.3.959 (R-Studio Inc.: Boston, United States) was employed for the analysis from August 1, 2021 to September to 30, 2021. Ethical statement. The